Get started with Wuzzy, a decentralized web crawling and search system built on the AO (Actor Oriented) protocol. This guide will walk you through setting up your first search index and crawler.
First, clone the Wuzzy AO repository and install it:
Next, we'll bundle the source code to prepare them for deployment:
The bundle
command creates a directory dist
which contains the bundled lua code ready for deployment.
Then, start an AOS session in your terminal, filling in your hyperbeam node url of choice & giving the process a name:
Important: When prompted to choose a runtime environment, select
hyper-aos
. Wuzzy relies on the use of the~relay@1.0
device to issue web requests.
This creates a new AO process that will serve as your Wuzzy Nest (search index). Take note of the process ID as we'll need it later.
Once aos has loaded and you see the prompt you can load the Wuzzy Nest lua code into your process:
Your Nest is now initialized and ready to index documents from crawlers & provide search functionality!
In another terminal, open up another AOS process:
Once aos has loaded and you see the prompt you can load the Wuzzy Crawler lua code into your process:
After the lua code has loaded, we'll need to set the crawler's NestId
so that it knows where to submit documents it crawls.
Back in the Wuzzy Nest process, we'll have to grant permission for the new crawler to submit documents:
You can repeat this step to spawn multiple crawlers.
Add URLs for your Crawler to process. URLs can be http, https, arns, or ar protocol schemes:
Multiple tasks can be added by separating them with a newline.
Supported URL formats:
http://domain.com
- HTTPhttps://secure-domain.com
- HTTPSarns://domain.ar.io
- Arweave Name System domainsar://transaction-id
- Direct Arweave transaction URLsAs Cron
is not currently available to hyper-aos
, we can send these messages ourselves, or from an automated script.
For example, from inside the crawler process, you can trigger Cron
manually:
You'll see some output that your crawler has added its crawl tasks to the queue and has requested the first task from the relay device.
When receiving a Cron
message, the Crawler will first check if there are any items in the Crawl Queue.
If there are no items in the Crawl Queue, the Crawler will populate the Crawl Queue with all items in its Crawl Tasks.
If there are items in the Crawl Queue, the Crawler will pop it from the queue and request the URL from the relay device.
When it receives a response from the relay device, the Crawler will parse the HTML for text content, meta description, title, and any links the page contains. If a link is under a domain in a crawler's Crawl Tasks, it'll add it to the queue.
Currently the Crawler will index text/html
and text/plain
content types.
In a future update, the Crawler will be able to identify other content types and forward them to an appropriate Classifier for analysis (i.e. images, video, audio)
In another future update, the Crawler will be able to send scraped text content to LLM Classifiers, which can ssubsequently update the document in the Nest to include semantic descriptions, further empowering search capabilities.
Once your Wuzzy Nest has some indexed content, you can search it by using Hyperbeam's HTTP API and the Wuzzy Nest view module.
View module id: NWtLbRjMo6JHX1dH04PsnhbaDq8NmNT9L1HAPo_mtvc
To issue a BM25 search for query "hyperbeam":
Search results are returned as a flat JSON structure:
By default only 10 results are returned, but you can request subsequent pages by adding the from
query parameter:
Search queries are case-insensitive and will include the text context (typically ~100 chars before and after), as well as wrapping matches in html tags for highlighting in results.
Using the Hyperbeam HTTP API, you can check the status of your nest using the same view module from the previous step & calling the nest_info
function:
You'll get a flat JSON structure in response with information about the Nest's stats, its crawlers, and documents it contains:
Similarly, you can use the Hyperbeam HTTP API with the crawler view module & calling the crawler_info
function:
Crawler view module: ZK1AXFffVJ2XNNIt5-s6NsI7r_nrsatoRdHyqSKs6xk
You'll get a flat JSON structure in response with information about the Crawler's info, its Crawl Tasks, the current Crawl Queue, and any Crawled URLs it is retaining in its Crawl Memory:
Both the Wuzzy Nest and Wuzzy Crawler contain ACL functionality. By default, the owner has access to all actions. You can authorize another user or process to perform an action by sending an Update-Roles
message to the Nest or Crawler:
You can also add a user or process as an admin
which allows them to perform all actions:
Crawler not indexing content:
Search returns no results:
Permission errors: